Manticore: A heterogeneous parallel language 

http : / /manticore . cs . uchicago . edu 


Matthew Fluet 

Toyota Technological Institute at Chicago 

fluet@tti-c.org 


Abstract 

The Manticore project is an effort to design and implement a 
new functional language for parallel programming. Unlike many 
earlier parallel languages. Manticore is a heterogeneous language 
that supports parallelism at multiple levels. Specifically, we com- 
bine CML-style explicit concurrency with NESL/Nepal-style data- 
parallelism. In this paper, we describe and motivate the design of 
the Manticore language. We also describe a flexible runtime model 
that supports multiple scheduling disciplines (e.g., for both fine- 
grain and course-grain parallelism) in a uniform framework. Work 
on a prototype implementation is ongoing and we give a status re- 
port. 

1. Introduction 

We believe that existing general-purpose languages do not provide 
adequate support for parallel programming, while existing paral- 
lel languages, which are largely targeted at scientific applications, 
do not provide adequate support for general-purpose programming. 
This state of affairs must change. The laws of physics and the lim- 
itations of instruction-level parallelism are forcing microproces- 
sor architects to develop new multicore processor designs, which 
means that parallel computing is coming to commodity hardware. 
We need new languages to maximize application performance on 
these new processors. 

Our thesis is that parallel languages must provide mechanisms 
for multiple levels of parallelism, both because applications exhibit 
parallelism at multiple levels and because the hardware requires 
parallelism at multiple levels to maximize performance. For ex- 
ample, consider a networked flight simulator. Such an application 
might use data-parallel computations for particle systems [Ree83] 
to model natural phenomena such as rain, fog, and clouds. At the 
same time it might use parallel threads to preload terrain and com- 
pute level-of-detail refinements, and use SIMD parallelism in its 
physics simulations. The same application might also use explicit 
concurrency for user interface and network components. Program- 
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ming such applications will be challenging without language sup- 
port for parallelism at multiple levels. 

This paper describes a new research project at the University of 
Chicago and TTI-C addressing the topic of language design and 
implementation for multicore processors. Our emphasis is on ap- 
plications that might run on commodity processors, such as mul- 
timedia processing, computer games, small-scale simulations, etc. 
These applications can exhibit parallelism at multiple levels with 
different granularities, which means that a homogeneous approach 
will not take advantage of all of the hardware resources. A language 
that provides data parallelism but not explicit concurrency will be 
inconvienent for the development of the networking and GUI com- 
ponents of a program. On the other hand, a language that provides 
concurrency but not data parallelism will be ill-suited for compo- 
nents of a program that demand fine-grain SIMD parallelism, such 
as image processing and particle systems. Instead, we propose a 
heterogeneous parallel language, called Manticore, that combines 
support for parallel computation at different levels into a common 
linguistic and execution framework. 

The Manticore language is rooted in the family of statically- 
typed strict functional languages such as OCAML and SML. We 
make this choice because functional languages emphasize a value- 
oriented and mutation-free programming model, which avoids en- 
tanglements between separate concurrent computations [Ham91, 
Rep91, JH93, NA01], We choose a strict language, rather than a 
lazy or lenient one, because we believe that strict languages are 
easier to implement efficiently and accessible to a larger commu- 
nity of potential users. On top of the sequential base language, 
Manticore provides the programmer with mechanisms for explicit 
concurrency and coarse-grain parallelism and mechanisms for fine- 
grain parallelism. 

Manticore’s concurrency mechanisms are based on Concurrent 
ML (CML) [Rep99], which provides support for threads and syn- 
chronous message passing. Although CML was not designed with 
parallelism in mind (in fact, its original implementation is inher- 
ently not parallel), we believe that it will provide good support for 
coarse-grain parallelism. In this respect, Manticore is similar to Er- 
lang, which has a mutation-free sequential core with message pass- 
ing [AVWW96], Erlang has parallel implementations [Hed98], but 
no support for fine-grain parallel computation. Manticore's support 
for fine-grain parallelism is influenced by previous work on nested 
data-parallel languages, such as NESL [BCH + 94, Ble96, BG96] 
and Nepal [CK00, CKLP01, LCK06], From these languages, Man- 
ticore adopts parallel arrays and parallel-array comprehensions. 

In addition to language design, we are exploring a unified run- 
time substrate for Manticore that can handle the disparate demands 



of explicit concurrency and various fine-grain parallel program- 
ming mechanisms. This substrate will provide a foundation for 
rapidly experimenting with alternative parallelism mechanisms. We 
have also been developing techniques for implementing CML’s 
message-passing operations in a multiprocessor setting. This work 
includes new protocols for the operations and a program anal- 
ysis and optimization techniques to improve the performance of 
message-passing programs [RX07], 

2. The Manticore language 

As noted above, the Manticore language provides the programmer 
with both explicit mechanisms for concurrency and coarse-grain 
parallelism and implicit mechanisms for fine-grain parallelism. 1 
For concurrency and coarse-grain parallelism, explicit mechanisms 
can be an effective technique, but for fine-grain parallelism they 
are burdensome to the programmer and may impose excessive 
overhead. In this section, we sketch the major features of our 
language design, demonstrate how different levels of parallelism 
may be used in a simple example, and discuss possible future 
directions for the design. 

Briefly, the design of the Manticore language combines three 
distinct components: a sequential base language using functional 
programming features, drawn from a (large) subset of SML; ex- 
plicit concurrent programming mechanisms using threads and syn- 
chronous message passing, drawn from CML [Rep91, Rep99]; 
and implicit parallel programming mechanisms using nested data- 
parallel constructs, drawn from NESL [Ble96] and Nepal [CKLPO 1 ] . 

In the sequential base language (and, by extension, the Man- 
ticore language as a whole), we include important features from 
SML, such as datatypes, polymorphism, type inference, and higher- 
order functions, but simplify the design by supporting only a sim- 
ple module system and by removing a number of non-essential el- 
ements. Most importantly, we remove mutable reference and ar- 
ray types, so the concurrency mechanisms drawn from CML are 
the only stateful operations in Manticore. 2 As many researchers 
have observed, using a mutation-free computation language greatly 
simplifies the implementation and use of parallel features [Ham91, 
Rep91, JH93, NA01, DG04], In essence, successful parallel lan- 
guages rely on notions of separation ; mutation-free functional pro- 
gramming gives data separation for free. 

The explicit concurrent programming mechanisms presented 
in Manticore serve two purposes: they support concurrent pro- 
gramming, which is an important feature for systems program- 
ming [HJT + 93], and they support explicit parallel programming. 
Like CML, Manticore supports threads that are explicitly created 
using the spawn primitive. Threads do not share mutable state; 
rather they use synchronous message passing over typed channels 
to communicate and synchronize. Additionally, we use CML com- 
munication mechanisms to represent the interface to imperative 
features such as input/output. 

The main intellectual contribution of CML’s design is an ab- 
straction mechanism, called first-class synchronous operations , for 
building synchronization and communication abstractions. This 
mechanism allows programmers to encapsulate complicated com- 
munication and synchronization protocols as first-class abstrac- 
tions, called event values, which encourages a modular style of 
programming where the actual underlying channels used to com- 
municate with a given thread are hidden behind data and type ab- 


1 We classify parallelism/concurrency mechanisms as either explicit , where 
the programmer manages thread creation, or implicit, where the compiler 
and runtime system manage thread creation. 

2 Note that we do not describe the sequential language as side-effect free, 

since it still supports exceptions. 


straction. Events can range from simple message-passing opera- 
tions to client-server protocols to protocols in a distributed system. 

CML has been used successfully in a number of systems, in- 
cluding a multithreaded GUI toolkit [GR93], a distributed tuple- 
space implementation [Rep99], a system for implementing parti- 
tioned applications in a distributed setting [YYS + 01], and a higher- 
level library for software checkpointing [ZSJ06], CML-style prim- 
itives have also been added to a number of other languages, in- 
cluding Haskell [RusOl], Java [Dem97], OCaml [LerOO], and 
SCHEME [FF04]. We believe that this history demonstrates the ef- 
fectiveness of CML’s approach to concurrency. 

At the heart of the implicit parallel programming mechanisms 
presented in Manticore are parallel arrays, which are immutable 
sequences that can be computed in parallel. An important feature of 
parallel arrays is that they may be nested ( i.e . , one can have parallel 
arrays of parallel arrays). Furthermore, Manticore (like Nepal, but 
unlike NESL) supports parallel arrays of arbitrary types including 
arrays of floats, functions, trees, etc. Based on the parallel array 
element type, the compiler will map parallel array operations onto 
the appropriate parallel hardware (e.g., operations on parallel arrays 
of floats may be mapped onto SIMD instructions). 

Parallel array values are constructed using a parallel compre- 
hension syntax, which provides a concise description of a parallel 
computation. 3 A comprehension has the general form 

[: e | x\ in ei , ..., x n in e n where p :] 

where e is the expression that computes the elements of the array, 
the ei are array-valued expressions used as inputs to e, and p is an 
optional boolean-valued expression that filters the input. If the input 
arrays have different lengths, they are truncated to the length of the 
shortest input. For example, to double each positive integer in a 
given parallel array of integers nums, one would use the following 
parallel comprehension: 

[ : 2 * n | n in nums where n > 0 : ] 

Another example is the definition of a parallel map combinator that 
maps a function across an array in parallel. 

fun mapP f xs = [ : f x | x in xs : ] 

The computation of elements in a comprehension can themselves 
be defined by comprehensions. We give an example of this pattern 
below in Figure 1 and other examples can be found in Blelloch’s 
work [Ble96], 

Comprehensions can be used to specify both SIMD parallelism 
that is mapped onto vector hardware {i.e., Intel’s SSE instructions) 
and SPMD parallelism where parallelism is mapped onto multiple 
cores. 

An important feature of parallel arrays is that they have a se- 
quential semantics, defined by mapping arrays to lists. The general 
comprehension form from above can be translated into the follow- 
ing sequential list code: 

let fun f (x\::ri, ..., x n ::r n , 1) = 

f(ri, ..., r„, if p then e::l else 1) 

I f (_, ■ • ■, 1) = rev 1 

in f (ei, . . ., efi, [] ) end 

where p, e, etc., are the translated subexpressions. 

Having a sequential semantics is useful in two ways: it pro- 
vides the programmer with a deterministic programming model and 
it formalizes the expected behavior of the compiler. Specifically, 

3 Some implicitly parallel languages, such as SISAL [GDF+97], 
Id [Nik9 1], and pH [NA01], allow independent computations to be executed 
in parallel with no programmer annotations, but most require programmer 
annotations to mark which computations are good candidates for parallel 
execution. For Manticore, we have chosen the latter approach because we 
believe that it will more easily coexist with the explicit parallel mechanisms. 



structure GrayServer : sig 

type pixel = int * int * int (* RGB encoding *) 

type img = [ : [ : pixel : ] : ] 

val convert : img -> img event 

end - struct 

type pixel = int * int * int 

type img = [: [: pixel :] :] 

fun rgbToG ((r,g,b) : pixel) : pixel = let 

val m = (r + g + b) div 3 

in 

(m, m, m) 

end 

fun imgToGray img = 

[ : [ : rgbToG pix | pix in row : ] 

| row in img : ] 
fun convert img = let 

val replCh = channel () 
in 

spawn (send (replCh, imgToGray img) ) ; 
recvEvt replCh 

end 

end 


Figure 1. An gray-scale converter 

the compiler must verify that the individual sub-computations in 
a data-parallel computation do not send or receive messages be- 
fore executing the computation in parallel. Furthermore, if a sub- 
computation raises an exception, the runtime code must delay de- 
livery of that exception until it has verified that all sequentially prior 
computations have terminated. Both of these restrictions require 
program analysis to implement efficiently. 

To demonstrate how the different concurrent- and parallel- 
programming mechanisms can be used in combination, we present 
a simple, but illustrative, example. Consider the implementation of 
a service for converting color images into gray-scale images. This 
computation is inherently data parallel, but an application may also 
want to process multiple images in parallel (e.g., if the service were 
web-based). The code in Figure 1 is a Manticore module that im- 
plements such a service. An image is represented as a parallel array 
of parallel arrays of pixels (the “ [ : : ] ” brackets double as a type 

constructor). The imgToGray function converts an image by us- 
ing a nested comprehension. This conversion process is presented 
to clients as an asynchronous operation (the convert function). 
When the convert function is called on an image, a new thread 
is spawned to do the conversion and an event value is returned that 
the client can later synchronize on to acquire the image. 

This section describes a first-cut design meant to give us a base 
for exploring multi-level parallel programming. Based on experi- 
ence with this design, we plan to explore a number of different 
evolutionary paths for the language. First, we plan to explore other 
parallelism mechanisms, such as the use of futures with work steal- 
ing [MKF190, CF1RR95, BL99], Such medium-grain parallelism 
would nicely complement the fine-grain parallelism (via parallel 
arrays) and the coarse-grain parallelism (via concurrent threads) 
present in Manticore. Second, there has been significant research 
on advanced type systems for tracking effects, which we may use 
to introduce imperative features into Manticore. As an alternative to 
traditional imperative variables, we will also examine synchronous 
memory (i.e., I-variables and M-variables a la Id [Nik9 1 ]) and soft- 
ware transactional memory (STM) [ST95]. 

3. A runtime model for Manticore 

Supporting parallelism at multiple levels poses interesting technical 
challenges for the implementation. We need a framework that can 
support both explicit parallel threads that run on a single processor 


and groups of implicit parallel threads that are distributed across 
multiple processors with specialized scheduling disciplines. Fur- 
thermore, we want the flexibility to experiment with new parallel 
language mechanisms that may require new scheduling disciplines. 

In this section, we describe an efficient and general runtime 
model for implementing scheduling disciplines (a more detailed 
description can be found in Rainey’s Master's paper [Rai07]). This 
model, which uses first-class continuations [Wan80, Rey93] to rep- 
resent suspended computations, provides a simple, but flexible, 
interface between the runtime system and the language imple- 
mentation. The runtime-system infrastructure supports both per- 
processor and nested schedulers. 4 As we demonstrate below, it is 
capable of supporting both explicit and implicit threading models 
in a unified framework. We present the model using SML for no- 
tational convenience, but it is actually implemented as part of the 
compiler’s internal representation. Specifically, user programs do 
not have direct access to the scheduling operations or to the under- 
lying continuation operations. 

3.1 Continuations 

Continuations are a well-known language-level mechanism for ex- 
pressing concurrency [Wan80, HFW84, Rep89, Shi97]. Continua- 
tions come in a number of different strengths or flavors. 

1. First-class continuations, such as those provided by SCHEME 
and SML/NJ, have unconstrained lifetimes and may be used 
more than once. They are easily implemented in a continuation- 
passing style compiler using heap-allocated continuations [App92] , 
but map poorly onto stack-based implementations. 

2. One-shot continuations [BWD96] have unconstrained life- 
times, but may only be used once. The one-shot restriction 
makes these more amenable for stack-based implementations, 
but their implementation is still complicated. In practice, most 
concurrency operations (but not thread creation) can be imple- 
mented using one-shot continuations. 

3. Escaping continuations 5 have a scope-limited lifetime and can 
only be used once, but they also can be used to implement 
many concurrency operations [RPOO, FR02], These continua- 
tions have a very lightweight implementation in a stack-based 
framework; they are essentially equivalent to the C library’s 
set jmp/long jmp operations. 

In Manticore, we are using continuations in our compiler's IR to 
express concurrency operations. For our prototype implementation, 
we are using heap-allocated continuations d la SML/NJ [App92], 
Although heap-allocated continuations impose some extra over- 
head (mostly increased GC load) for sequential execution, they pro- 
vide a number of advantages for concurrency: 

• Creating a continuation just requires allocating a heap object, 
so it is fast and imposes little space overhead (< 100 bytes). 

• Since continuations are values, many nasty race conditions in 
the scheduler can be avoided. 

• Heap-allocated first-class continuations do not have the lifetime 
limitations of escaping and one-shot continuations, so we avoid 
prematurely restricting the expressiveness of our IR. 

• By inlining concurrency operations, the compiler can optimize 
them based on their context of use [FR02], 


4 Regehr coined the term “ general , heterogeneous schedulers” for similar 
scheduler hierarchies [RegOl]. 

5 The term “escaping continuation” is derived from the fact that they can be 
used to escape. 



3.2 Fibers, threads, and virtual processors 

Our runtime model has three distinct notions of process abstraction. 
At the lowest level, a fiber is an unadorned thread of control, which 
is represented as a unit continuation. 

type fiber = unit cont 

The fiber operator takes a function value, and creates a fiber that, 
when run, calls the function before stopping. 

val fiber : (unit — > unit) -> fiber 

Note that this operator can be directly implemented with first-class 
continuations (but not with one-shot continuations). 

A surface-language thread (i.e., one created by spawn) is ini- 
tially mapped to a fiber paired with a unique thread ID (tid). 

type thread = tid * fiber 

In addition to having an ID, threads are differentiated from fibers 
by the fact that they may create additional fibers to run data-parallel 
computations. Thus at run time, a thread consists of a tid and one 
or more fibers. 

Lastly, a virtual processor (vproc) is an abstraction of a hard- 
ware processor resource. The runtime model represents a vproc 
with the vproc type. A vproc runs at most one fiber at a time, 
and furthermore is the only means of running fibers. The vproc for 
the currently running fiber is called the host vproc , and is obtained 
by the host VP operator. 

val hostVP : unit — > vproc 

The runtime model provides a mechanism for assigning vprocs 
to threads. When applied to the desired number of processors, 
provision returns a list of vprocs that are available for a thread 
(which may be fewer than the number requested). The complemen- 
tary release operator informs the runtime system that a thread is 
finished with some vprocs. 

val provision : int — > vproc list 
val release : vproc list — > unit 

To balance workload evenly between threads, the runtime system 
never assigns a vproc to a given thread twice. Additionally, the 
runtime system considers load and possibly even processor affinity 
when assigning vprocs. 

3.3 Scheduling infrastructure 

Our scheduling infrastructure is a low-level substrate for writing 
schedulers. It directly encodes all scheduling that occurs at run 
time, and does not rely on external or fixed schedulers. Our ap- 
proach to scheduling is inspired by Shivers’ proposal for exposing 
hardware concurrency using continuations [Shi97], but we have ex- 
tended it to support nested schedulers and multiple processors. To 
support a variety of scheduling disciplines, the infrastructure pro- 
vides mechanisms that divide a vproc’s time among multiple fibers 
and mechanisms that divide and synchronize parallel computations 
among multiple vprocs. The former mechanisms are described in 
detail here. 

A scheduler action is a function that implements context switch- 
ing for a vproc. By defining different functions, we can implement 
different scheduling policies. Scheduler actions have the type 

datatype signal = STOP | PREEMPT of fiber 
type action = signal — > void 

where the signal type represents the events that are handled by 
schedulers. Here we have two — fiber termination and preemption 
— but this type could be extended to model other forms of asyn- 
chronous events, such as asynchronous exceptions [MJMR01], A 
scheduler action should never return, so its result type (void) is 
one that has no values. 


Our model supports nesting of schedulers ( e.g ., a data-parallel 
scheduler runs on top of a thread-level scheduler) by giving each 
vproc a stack of scheduler actions. The top of a vproc’s stack is the 
scheduler action for the current scheduler on that vproc. When a 
vproc receives a signal, it handles it by popping the current sched- 
uler action from the stack and applying it to the signal. Figure 2 
gives a pictorial description of the operations on a vproc’s action 
stack, which we describe below. 

There are two operations that scheduling code can use to di- 
rectly affect the host vproc’s scheduler stack. 

val run : action — > fiber — > void 
val forward : signal — > void 

The run primitive initiates the execution of a fiber. It takes a 
scheduler action that implements the scheduling policy for the fiber 
and the fiber itself, pushes the action on the scheduler-action stack, 
and then runs the fiber. The expression “forward sig” sends the 
signal to the host vproc, which means that topmost signal action is 
popped from the stack and applied to the sig. Our model uses this 
operation to implement the stop function for fiber termination. 

fun stop () = forward STOP 

Preemption is generated by a hardware event, such as as timer 
interrupt. When a vproc is preempted, it reifies the continuation of 
the running fiber k, and then executes “preempt k,” where the 
preempt function is defined as 

fun preempt k = forward (PREEMPT k) 

The vproc then handles the signal as usual; it applies the the current 
scheduler action to the preemption signal. Using preempt, we can 
define a function that yields the vproc to the current scheduler. 

fun yield () = callcc (fn k => preempt k) 

In addition to the scheduler stack, each vproc also has a queue 
of ready threads. These queues are used to schedule threads, and 
are used as a mechanism to dispatch threads on multiple vprocs. 
There are three operations on these queues: 

val enqueue : thread — > unit 
val dequeue : unit — > thread 

val enqueueOnProc : (vproc * thread) — > unit 

The first two operations apply to the host vproc’s queue. The second 
operator blocks a vproc on an empty queue. It can be unblocked 
when another vproc puts a thread on its queue. The third operator 
puts a thread on another vproc's queue, and is the only mechanism 
for parallel dispatch in the runtime model. 

To avoid the danger of asynchronous preemption while schedul- 
ing code is running, the forward operation masks preemption and 
the run operation unmasks preemption on the host vproc. We also 
provide operations for explicitly masking and unmasking preemp- 
tion on the host vproc. 

val mask : unit — > unit 

val unmask : unit — > unit 

3.4 Scheduling language-level threads 

Language-level thread scheduling is round-robin, and is imple- 
mented by the following scheduler action: 

fun switch STOP = dispatch () 

| switch (PREEMPT k) = ( 

enqueue (getTidO, k) ; dispatch ()) 
and dispatch () = let 

val (tid, k) = dequeue () 
in 

setTid tid; run switch k 

end 

The dispatch function runs the next thread from the vproc’s 
queue and is also used in the implementation of language-level 
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Figure 2. How run, forward, and preemption affect a vproc. 


concurrency operations. Note that it invokes the thread using the 
run function with switch as the scheduler. This scheduler action 
is the first action on every vproc’s stack. Our infrastructure can also 
support more complex time-sharing and priority-based policies, 
and can support migration policies, such as work stealing [CR95, 
BL99], 

3.5 Scheduling data-parallel fibers 

Data-parallel computations require multiple fibers running on mul- 
tiple vprocs. There are a number of different ways to organize this 
computation, but we use a workcrew approach [VR88]. The com- 
piler flattens the nested data parallelism into a flat operation [BG96, 
LCK06], and partitions it into a number of jobs. Each job should 
perform a significant chunk of the total work, employing SIMD 
parallelism when possible. The code in Figure 3 is a function used 
at runtime to schedule the jobs in parallel. It takes the number of 
processors, number of jobs, and a function for computing the ith 
job. The scheduler initializes itself by allocating a group of vprocs, 
and then applying each to the init function. This function takes a 
vproc, and enqueues on it a fiber that installs the scheduler action 
dlpSwitch. 

Once it is initialized on a vproc, the dlpSwitch action ac- 
quires jobs from the work pool, and handles preemptions. The 
STOP signal is used to signal the completion of a job; if there are 
no more jobs available, then we relinquish the host vproc by releas- 
ing the vproc and then stopping. The last vproc to complete a job 


does not stop, but instead returns from the f orkN function. When 
the scheduler receives a PREEMPT signal, it yields control to the 
parent scheduler. At some point in the future, the parent scheduler 
will resume the data-parallel scheduler. If, for example, the parent 
is the language-level thread scheduler, the thread scheduler will re- 
sume the data-parallel scheduler once it cycles through its ready 
queue. In this way, the host vproc can be multiplexed among both 
data-parallel and explicit-parallel computations. 

3.6 Other scheduling disciplines 

Our infrastructure is general enough to implement a wide variety 
of schedulers and we have sketched implementations of a num- 
ber of different mechanisms [Rai07]. These include engines [HF84] 
and nested engines [DH89], which are an elegant mechanism that 
provide timed preemption for a collection of threads. Other exam- 
ples include work stealing [MKH90, CHRR95, BL99] and wait- 
free cache-affinity work stealing [KD03]. We are also implement- 
ing schedulers that can adaptively provision vprocs via an exten- 
sion to our signaling mechanism that is similar to scheduler activa- 
tions [ABLL92]. 

In the long run, we believe that application-specific scheduling 
policies may be an important tool in maximizing parallel perfor- 
mance. Since implementing scheduling and load-balancing policies 
in general-purpose languages is error prone and tedious, we plan 
to explore domain-specific languages for programming schedulers. 
For example, the Bossa scheduler language ameliorates implemen- 








fun forkN (nProcs, nJobs, job : int -> unit) = 
callcc (fn doneK => let 

val (cnt, done) = (ref 0, ref 0) 
fun dlpSwitch STOP = let 

val next Job = fetchAndAdd (cnt , 1) 
in 

if (nextJob < nJobs) then 
run dlpSwitch 

(fiber (fn () => job nextJob)) 
else if ( fetchAndAdd (done, 1 ) = 
nProcs-1 ) 

then throw doneK ( ) 
else ( 

release [hostVP ()]; 
stop ( ) ) 

end 

| dlpSwitch (PREEMPT k) = ( 
yield ( ) ; 
run dlpSwitch k ) 
fun init vp = enqueueOnProc (vp, 

( getTidO, 

fiber (fn () => 

run dlpSwitch (fiber stop) ) 

) ) 

in 

List.app init (provision nProcs); 
stop () 

end) 


Figure 3. Creating fibers for data-parallel computations 

tation difficulties by using specialized abstractions for expressing 
policies and static checks of those policies [MLD05]. 

4. Multiprocessor CML 

Concurrent ML is embedded in Standard ML of New Jersey. It is 
implemented uses the first-class continuations of SML/NJ and is in- 
herently single threaded [Rep91, Rep99]. Thus, we are faced with 
developing a new multi-threaded implementation of CML's primi- 
tives suitable for modem multicore processors. The main challenge 
is the implementation of event synchronization, which involves a 
form of distributed agreement. In the general case, an event con- 
sists of a choice of channel communications and synchronization 
involves picking one of the enabled communications in the choice 
and executing it. What makes this problem difficult is that the other 
party involved in the communication may itself be involved in a 
choice, so we need a protocol that guarantees that both parties agree 
on the communication. Furthermore, a choice may involve multiple 
operations on the same channel, which makes deadlock avoidance 
a bit tricky. 

The single-threaded implementation achieves this agreement by 
executing the following steps atomically [Rep99] : 

• Poll the communication operations (e.g., sends, recvs, etc.) in 
the the choice to see if they are enabled. 

• If one or more operations are enabled, pick one and do it. 

• Otherwise, enqueue continuations for each of the choices and 
dispatch another thread. 

The single-threaded implementation relies on the global lock for 
correctness and, since there is only one processor, it does not hurt 
performance, but in a parallel implementation the global lock is a 
bottleneck. 

An obvious first step is to give each channel its own lock, but 
to avoid deadlock when there are multiple operations on the same 
channel, we must either release the lock after polling the channel 
or use reentrant locks. We explored this approach, but found that 


the implementation was complicated. Instead, we have designed an 
optimistic protocol for implementing choice that has the following 
steps: 

• First we poll channels for possible communications, which can 
be done in a lock-free way). 

• If there are available communications, attempt to commit to 
one of them. This commit may fail because another thread has 
“stolen” the communication. 

• If there are no available communications (or all attempts to 
commit failed), we block the thread on the channels. In this 
process, we may discover that a communication has become 
available, in which case we commit to it. 

This protocol is optimized to the common case where a given 
channel is not shared between more than two threads, but it remains 
to be seen how well it works when there is contention for a shared 
channel. 

Another aspect of our approach to implementing message pass- 
ing is the development of a program analysis for detecting spe- 
cial patterns of channel usage. For example, our analysis can detect 
when a channel is only used by a single sender and single receiver 
in non-choice contexts. In such a case, the channel operations can 
be implemented using a single atomic compare-and-swap instruc- 
tion. which is much faster than the general protocol. This program 
analysis and optimization technique, which we are implementing 
as part of the Manticore compiler, is discussed in full detail else- 
where [Xia05, RX07], 

5. Status 

We are currently working on an initial implementation of Manti- 
core that will provide a testbed for future research in both language 
design and implementation techniques. Our initial implementation 
targets the 32 and 64-bit versions of the x86 architecture on Linux 
and we hope to have a public release ready by the Spring of 2007. 
This effort is proceeding along two tracks. 

The first is a compiler and interpreter for the Manticore lan- 
guage. To speed the construction of this prototype, we have ex- 
tended the HaMLet SML compiler [Ros] with syntax for our 
data-parallel and concurrency operations. We are using HaMLet as 
both a parser/typechecker for our compiler and as source-to-source 
translator that converts Manticore programs to CML programs. Al- 
though this translator does not support parallelism, it does allow us 
to gain experience with programming in Manticore. For our com- 
piler, we are using the MLRISC framework for code generation 
and register allocation [GGR94, GA96], 

We have also implemented a prototype of the runtime scheduler 
infrastructure described in Section 3. The implementation is writ- 
ten in C on Linux, with each vproc being represented by a POSIX 
thread. We are using this framework to gauge both the expressive- 
ness of our model for writing schedulers and their performance. 
So far, we have implemented several data-parallel examples, but 
we have not yet made any performance measurements. We are also 
integrating the multiprocessor implementation of CML described 
above into the Manticore runtime system. Lastly, the runtime sys- 
tem model has been formalized as a parallel CEK machine [Rai07], 
which will provide a guide for both the compiler and runtime im- 
plementation efforts. 
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